There is a large, diverse literature on creating indexes of what amounts to “Socioeconomic status” in spatial areas like neighborhoods. These indexes can then be used to investigate correlations between socioeconomics and health outcomes, infrastructure access, environmental exposures, or other variables, or target certain neighborhoods that seem particularly vulnerable for interventions. Many variables can be used to contribute to such an index. However, it is always difficult to choose which variables and how to weight them. For example, the USEPA EJSCREEN project tracks 6 demographic variables, but only uses two when constructing demographic indices to weight neighborhoods for an equity- weighting of environmental exposures: a simple average of % of population low-ioncome and % population “minority” (nonwhite alone/nonhispanic). One method that has been popular in social science and public health policy has been Principal Components Analysis (PCA). See this study for a foundational example. PCA creates linear combinations of sets of variables that explain the maximum amount of overall variance in the complete dataset. This amounts to weighting the variables that have the widest spread and that covary with other variables. In this way, any arbitrary combination of variables of interest to policymakers/planners/analysts can be chosen, and a linear combination of them constructed for use as a single or multiple indices in an automated manner. This approach has the added value of being customizable for different communities, where different axes of inequity may be more or less important or impactful than in other communities.
In the United States, typically this activity has been done to identify regions with socioeconomic correlations with cancer incidence. The typical source for variables has been the U.S Census American Community Survey, which includes some detailed demographic information about race/ethnicity, education levels, income, occupation, and housing quality, among other topics.
Two or more of the following variables are often used:
Some studies using them are listed below:
First, we take an example service area boundary. In this case the Buffalo municipal boundary. This could come from an external API but in this case comes from a local file. We re-project to WGS84 just to be standard to begin with.
#pws <- read_sf("C:/Users/kyleo/Downloads/pws.gpkg")
#newark <- filter(pws,PWSID=="NJ0714001")
boundary <- sf::read_sf("data/geospatial_data/COB_municipal_boundary.shp") %>% st_transform(prj)
mapview(boundary)
Then we find the relevant Census Block Groups (orange) or Tracts (green), first downloading the entire state and then subsetting to the utility service area (blue). We
bg <- get_acs(state = state, geography = "block group", variables = c("B01001_001","B11016_001"), geometry = TRUE, output="wide") %>% st_transform(prj)
## Getting data from the 2015-2019 5-year ACS
tr <- get_acs(state = state, geography = "tract", variables = c("B01001_001","B11016_001"), geometry = TRUE, output="wide") %>% st_transform(prj)
## Getting data from the 2015-2019 5-year ACS
bg1 <- st_intersects(bg,boundary) %>% as.integer()
bg <- bg[which(bg1==1),]
tr1 <- st_intersects(tr,boundary) %>% as.integer()
tr <- tr[which(tr1==1),]
mapview(boundary, layer.name=name, map.types= "OpenStreetMap") +
mapview(bg, alpha.regions=0, color="orange", col.regions="orange",lwd=2, layer.name="Block Groups", map.types= "OpenStreetMap") +
mapview(tr, alpha.regions=0, color="green",col.regions="green", lwd=3, layer.name="Tracts", map.types= "OpenStreetMap")
Now we download the necessary census data and match them to the appropriate census boundaries.
vars <- tidycensus::load_variables(2019, "acs5", cache = TRUE)
cv <- c(pop_race_count = "B03002_001",
pop_nonhispanic_white_alone = "B03002_003",
pop_educ_attainment_count = "B15003_001",
pop_educ_none = "B15003_002", #0
pop_educ_nursery = "B15003_003", #0
pop_educ_kindergarten = "B15003_004", #0
pop_educ_grade1 = "B15003_005", #1
pop_educ_grade2 = "B15003_006", #2
pop_educ_grade3 = "B15003_007", #3
pop_educ_grade4 = "B15003_008", #4
pop_educ_grade5 = "B15003_009", #5
pop_educ_grade6 = "B15003_010", #6
pop_educ_grade7 = "B15003_011", #7
pop_educ_grade8 = "B15003_012", #8
pop_educ_grade9 = "B15003_013", #9
pop_educ_grade10 = "B15003_014", #10
pop_educ_grade11 = "B15003_015", #11
pop_educ_grade12_nodiploma = "B15003_016", #11.5
pop_educ_HSdiploma = "B15003_017", #12
pop_educ_GED = "B15003_018", #12
pop_educ_college_less_1year = "B15003_019", #13
pop_educ_college_more_1year_nodegree = "B15003_020", #13.5
pop_educ_assoc = "B15003_021", #14
pop_educ_bachelor = "B15003_022", #16
pop_educ_master = "B15003_023", #18
pop_educ_prof = "B15003_024", #19
pop_educ_doc = "B15003_025", # 21
poverty_level_total = "C17002_001",
poverty_level_gr_200FPL = "C17002_008" ,
hh_income_20ptile = "B19080_001",
hh_income_median = "B19013_001",
pop_hh_income_assistance_total = "B09010_001",
pop_in_hh_with_income_assistance = "B09010_002",
per_capita_income = "B19301_001",
pop_labor_force = "B23025_001",
pop_unemployed = "B23025_005",
housing_units_count = "B25002_001",
housing_units_vacant = "B25002_003",
housing_units_occupied = "B25002_002",
pop_units_count = "B25033_001",
pop_owner_occupied_units = "B25033_002",
pop_rental_units = "B25033_008",
housing_units_plumbing_count = "B25050_001",
housing_units_plumbing_complete = "B25050_002",
rent_median = "B25058_001",
median_gross_rent_percent_hh_income = "B25071_001",
home_value_median = "B25077_001",
hh_owner_occupied = "B25014_002",
hh_owner_occupied_1.5_2_perRoom = "B25014_006",
hh_owner_occupied_gr2.0_perRoom = "B25014_007",
hh_renters = "B25014_008",
hh_renters_1.5_2_perRoom = "B25014_012",
hh_renters_gr2.0_perRoom = "B25014_013",
hh_lang_total = "C16002_001",
hh_lang_ltd_api ="C16002_010",
hh_lang_ltd_otherio = "C16002_007",
hh_lang_ltd_spanish = "C16002_004",
hh_lang_ltd_other = "C16002_013",
pop_age_total = "B01001_001",
pop_age_male_65_66 = "B01001_020",
pop_age_male_67_69 = "B01001_021",
pop_age_male_70_74 = "B01001_022",
pop_age_male_75_79 = "B01001_023",
pop_age_male_80_84 = "B01001_024",
pop_age_male_85up = "B01001_025",
pop_age_female_65_66 = "B01001_044",
pop_age_female_67_69 = "B01001_045",
pop_age_female_70_74 = "B01001_046",
pop_age_female_75_79 = "B01001_047",
pop_age_female_80_84 = "B01001_048",
pop_age_female_85up = "B01001_049",
occupation_total = "C24010_001",
occupation_sales_office_male = "C24010_027",
occupation_sales_office_female = "C24010_063",
occupation_service_male = "C24010_019",
occupation_service_female = "C24010_055",
occupation_protective_service_male = "C24010_021",
occupation_protective_service_female = "C24010_057",
occupation_natresource_male = "C24010_030",
occupation_natresource_female = "C24010_066",
occupation_prodtrans_male = "C24010_034",
occupation_prodtrans_female = "C24010_070",
hh_vehicle_count = "B25044_001",
hh_owner_noVehicle = "B25044_003",
hh_renter_noVehicle = "B25044_010",
families_total = "B11003_001",
families_singleparent_female = "B11003_016",
families_singleparent_male = "B11003_010"
)
counties <- tidycensus::get_acs(year=2019,
geography = "county",
variables = "B01001_001",
geometry = TRUE) %>%
sf::st_transform(prj) %>%
dplyr::filter(lengths(sf::st_intersects(., boundary)) > 0)
## Getting data from the 2015-2019 5-year ACS
# Find FIPS code for relevant state and counties, make sure no duplicates
st <- unique(substr(counties$GEOID,1,2))
ct <- unique(substr(counties$GEOID,3,5))
data.tr <- get_acs(year=2019,
geography = "tract",
variables = cv,
geometry = TRUE,
state = st,
county=ct,
output="wide") %>%
filter(GEOID %in% tr$GEOID)
## Getting data from the 2015-2019 5-year ACS
data.bg <- get_acs(year=2019,
geography = "block group",
variables = cv,
geometry = TRUE,
state = st,
county=ct,
output="wide") %>%
filter(GEOID %in% bg$GEOID)
## Getting data from the 2015-2019 5-year ACS
Now we construct the variables for our PCA at the Tract and Block Group level.
pcaData.bg <- data.bg %>%
mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
pop_educ_GEDE +
pop_educ_college_less_1yearE +
pop_educ_college_more_1year_nodegreeE +
pop_educ_assocE +
pop_educ_masterE +
pop_educ_profE +
pop_educ_docE)/pop_educ_attainment_countE),
percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
income_median = hh_income_medianE,
income_20tile = hh_income_20ptileE,
income_percapita = per_capita_incomeE,
percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
rentMedian = rent_medianE,
rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
homeValuemedian = home_value_medianE,
percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
percBlueCollar = 100*(occupation_sales_office_maleE +
occupation_sales_office_femaleE +
occupation_service_maleE +
occupation_service_femaleE -
occupation_protective_service_maleE -
occupation_protective_service_femaleE +
occupation_natresource_maleE +
occupation_natresource_femaleE +
occupation_prodtrans_maleE +
occupation_prodtrans_femaleE)/occupation_totalE,
percUnemployed=pop_unemployedE/pop_labor_forceE,
percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
percAgeOver64 = 100*(pop_age_male_65_66E +
pop_age_male_67_69E +
pop_age_male_70_74E +
pop_age_male_75_79E +
pop_age_male_80_84E +
pop_age_male_85upE +
pop_age_female_65_66E +
pop_age_female_67_69E +
pop_age_female_70_74E +
pop_age_female_75_79E +
pop_age_female_80_84E +
pop_age_female_85upE)/pop_age_totalE,
percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
hh_lang_ltd_otherioE +
hh_lang_ltd_spanishE +
hh_lang_ltd_otherE)/hh_lang_totalE,
percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
hh_owner_occupied_gr2.0_perRoomE +
hh_renters_1.5_2_perRoomE +
hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
select(GEOID,percBelow200FPL:percHH_Crowded) %>%
st_drop_geometry()
pcaData.tr <- data.tr %>%
mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
pop_educ_GEDE +
pop_educ_college_less_1yearE +
pop_educ_college_more_1year_nodegreeE +
pop_educ_assocE +
pop_educ_masterE +
pop_educ_profE +
pop_educ_docE)/pop_educ_attainment_countE),
percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
income_median = hh_income_medianE,
income_20tile = hh_income_20ptileE,
income_percapita = per_capita_incomeE,
percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
rentMedian = rent_medianE,
rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
homeValuemedian = home_value_medianE,
percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
percBlueCollar = 100*(occupation_sales_office_maleE +
occupation_sales_office_femaleE +
occupation_service_maleE +
occupation_service_femaleE -
occupation_protective_service_maleE -
occupation_protective_service_femaleE +
occupation_natresource_maleE +
occupation_natresource_femaleE +
occupation_prodtrans_maleE +
occupation_prodtrans_femaleE)/occupation_totalE,
percUnemployed=pop_unemployedE/pop_labor_forceE,
percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
percAgeOver64 = 100*(pop_age_male_65_66E +
pop_age_male_67_69E +
pop_age_male_70_74E +
pop_age_male_75_79E +
pop_age_male_80_84E +
pop_age_male_85upE +
pop_age_female_65_66E +
pop_age_female_67_69E +
pop_age_female_70_74E +
pop_age_female_75_79E +
pop_age_female_80_84E +
pop_age_female_85upE)/pop_age_totalE,
percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
hh_lang_ltd_otherioE +
hh_lang_ltd_spanishE +
hh_lang_ltd_otherE)/hh_lang_totalE,
percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
hh_owner_occupied_gr2.0_perRoomE +
hh_renters_1.5_2_perRoomE +
hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
select(GEOID,percBelow200FPL:percHH_Crowded) %>%
st_drop_geometry()
Now, we inspect the distributions of the variables to see which ones may be problematic. It appears that the following variables are highly skewed and should be transformed:
It also appears that the following variables are not available or otherwise highly missing at the block group level:
If clients are determined to use these variables, tract-level analysis may be necessary, which may not work for custom geographies that need to be assembled from smaller geographies than tracts.
We cannot use public assistance with Census Block Groups, but 20th percentile of income can be calculated from other means, and median home value can be imputed.
We use a random forest missing value imputation algorithm (see here) to fill in missing data for the purposes of later indexing. Compare median home values in the unimputed vs. imputed datasets
imp.bg <- dplyr::select(pcaData.bg,-GEOID,-income_20tile,-percAssistance)
imp.bg <- missForest(imp.bg)
missForest iteration 1 in progress…done! missForest iteration 2 in progress…done! missForest iteration 3 in progress…done!
imp.bg <- imp.bg$ximp
imp.bg$GEOID <- pcaData.bg$GEOID
missing <- select(pcaData.bg,GEOID,homeValuemedian)
imputed <- select(imp.bg,GEOID,homeValuemedian)
imp.demo <- left_join(bg,missing,by="GEOID")
imp.demo <- left_join(imp.demo,imputed,by="GEOID")
map.miss <- mapview(imp.demo,zcol="homeValuemedian.x",layer.name="Median Home value (NA)")
map.imp <- mapview(imp.demo,zcol="homeValuemedian.y", layer.name="Median Home Value (imputed)")
sync(map.miss,map.imp)
With all missing data imputed, we can now demonstrate how U.S. Census data can be re-estimated for arbitrary larger geographies. Below, take the example of the City of Newark’s Neighborhood boundaries, compared with Census Block Groups.
alt <- sf::read_sf("data/geospatial_data/COB_council_distrcits.shp") %>% st_transform(prj)
mapview(bg, layer.name= "Census Block Groups") + mapview(alt, alpha.regions=0, col.regions="darkslategrey", lwd=3, color="darkslategrey", layer.name="Alt. Polygons (Neighborhoods)") + mapview(boundary, alpha.regions=0, lwd=2, color="black", col.regions="black",layer.name = name)
In this case, Buffalo’s alternative polygons seem to mostly be combinations of Census Block Groups, with minor discrepancies at borders. Other alternative polygons may be substantially different, so we implement a general weighted-area algorithm to reweight our census variables to the alternative polygon. The principle is the same as that documented here.
First, we compute the area of the block groups (projecting to the appropriate state plane projection)
bg <- bg %>%
st_transform(prj) %>% # project to NAD83/New York West Projection
mutate(area_bg = st_area(.))# compute area
Then we compute the area of the alternative polygons (projecting to the appropriate state plane projection)
alt <- alt %>%
st_transform(prj) %>% # project to NAD83/New York West Projection
mutate(area_alt = st_area(.))# compute area
Now we compute the spatial intersection of the alternate geometry with the block groups
alt.bg <- sf::st_intersection(alt,bg)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
alt.bg <- alt.bg %>%
mutate(area_alt_bg = st_area(.), #calculate area of intersection of block group and alternative polygon
share_area = area_alt_bg/area_alt, #calcualte share of area of each block group in each custom polygon
pop = share_area * B01001_001E, #calculate population of each intersection
hh = share_area * B11016_001E) %>% #calcualte hh of each intersection
group_by(DIST_ID) %>% #group calculations by alternate polygon
mutate(alt_pop = sum(pop), # calculate population of alternate polygons
alt_hh = sum(hh) #calculate hh in alternate polygons
) %>%
ungroup() %>% #ungroup calculations
mutate(wgt_pop = pop/alt_pop, #calculate prop of population of alternate polygon represented by each intersection
wgt_hh = hh/alt_hh) #calculate prop of hh of alternate polygon represented by each intersection
Now we can weight every variable by hh and pop weights as appropriate to come up with the alternative polygon measures.
imp.bg.long <- pivot_longer(imp.bg,!GEOID,names_to = "var",values_to = "value")
imp.bg.long <- left_join(imp.bg.long,alt.bg,by="GEOID")
imp.bg.long$value <- imp.bg.long$value * imp.bg.long$wgt_hh
imp.bg.long <- select(imp.bg.long,!!sym(alt_name),var,value)
imp.neigh.long <- imp.bg.long %>%
group_by(!!sym(alt_name),var) %>% #change depending on alt polygon name
summarise(value=sum(value))
## `summarise()` has grouped output by 'District'. You can override using the `.groups` argument.
alt_polygons <- alt %>%
left_join(imp.neigh.long,by=as.character(sym(alt_name))) %>%
select(!!sym(alt_name),var,value) %>%
mutate(alt_polygon_id = !!sym(alt_name)) %>%
select(alt_polygon_id,var,value)
v <- as.data.frame(unique(alt_polygons$var))
v$units = c("USD",
"USD",
"USD",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"%",
"ratio",
"USD"
)
colnames(v) <- c("var","units")
alt_polygons <- na.omit(alt_polygons)
alt_data <- st_drop_geometry(alt_polygons)
alt_data <- left_join(alt_data,v,by="var")
alt_data$value <- as.numeric(alt_data$value)
alt_polygons <- distinct(select(alt_polygons,alt_polygon_id))
alt_polygons <- st_transform(alt_polygons,4326)
bg_data <- pivot_longer(imp.bg,!GEOID,names_to = "var",values_to = "value")
bg_data <- left_join(bg_data,v,by="var")
bg_polygons <- select(bg,GEOID)
boundary.p <- st_transform(boundary,prj) #change based on state plane
bg_polygons <- st_intersection(bg_polygons,boundary.p)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
bg_polygons <- select(bg_polygons,GEOID)
bg_polygons <- st_transform(bg_polygons,4326)
bg_data$var <- paste0("census_",bg_data$var)
alt_data$var <- paste0("census_",alt_data$var)
# skim(pcaData.tr)
# skim(pcaData.bg)
With this knowledge, we conduct the PCA, first on all the available variables for Block Groups and for Tracts. Below, we report on the first three principal components of the PCA for the tracts and block groups.
# is.nan.data.frame <- function(x)
# do.call(cbind, lapply(x, is.nan))
#
# pcaData.bg[is.nan(pcaData.bg)]<-NA
# x=select(pcaData.bg,-GEOID,-percAssistance, -income_20tile)
bg_data_wide <- bg_data %>% pivot_wider(id_cols=GEOID,names_from = var, values_from = value)
pca.matrix.bg <- bg_data_wide %>% select(-GEOID)
pca.bg <- prcomp(pca.matrix.bg, scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
bg_data_wide$PC1 <- pca.bg$x[,1]
bg_data <- pivot_longer(bg_data_wide,!GEOID,names_to = "var",values_to = "value")
alt_data_wide <- alt_data %>% distinct(alt_polygon_id,var,.keep_all=TRUE) %>% pivot_wider(id_cols=alt_polygon_id,names_from = var, values_from = value)
pca.matrix.alt <- alt_data_wide %>% select(-alt_polygon_id)
pca.alt <- prcomp(pca.matrix.alt, scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
alt_data_wide$PC1 <- pca.alt$x[,1]
alt_data <- pivot_longer(alt_data_wide,!alt_polygon_id,names_to = "var",values_to = "value")
v$var <- paste0("census_",v$var)
alt_data <- left_join(alt_data,v,by="var")
colnames(alt_polygons)[1]<-"geo_id"
colnames(bg_polygons)[1]<-"geo_id"
write_sf(alt_polygons,"finalData/alternate.geojson",delete_dsn=TRUE)
write_sf(bg_polygons,"finaldata/block_group.geojson",delete_dsn=TRUE)
overall <- left_join(bg_data,select(bg,GEOID,B01001_001E),by="GEOID")
overall <- st_as_sf(overall)
overall$area_bg <- st_area(overall)
overall <- st_intersection(overall,select(boundary.p,OBJECTID))
overall$area_intersection <- st_area(overall)
overall$area_share <- overall$area_intersection/overall$area_bg
overall$wgt <- (overall$area_share * overall$B01001_001E) / sum(bg$B01001_001E)
overall$value <- overall$value * overall$wgt
overall <- overall %>% st_drop_geometry() %>% group_by(var) %>% summarise(value=sum(value))
n <- boundary %>% select(OBJECTID)
colnames(n)[1]<-"geo_id"
write_sf(n,"finaldata/service_area.geojson",delete_dsn=TRUE)
First, we create a vector of the indicator variables that are in the data dictionary, and then create an empty data frame encompassing all polygons and these variables, iterated over years of choice.
year <- c(2010,2020)
var <- c("pillar1_hbi",
"pillar1_ppi",
"pillar1_affordability",
"pillar1_cutoff",
"pillar1_cap",
"pillar1_connections_water",
"pillar1_connections_sewer",
"pillar2_workforce",
"pillar2_diversity_1",
"pillar2_procurement",
"pillar2_training",
"pillar2_park_distance",
"pillar2_park_walk_access",
"pillar2_waterfront_acess",
"pillar3_breaks",
"pillar3_sewer_failures",
"pillar3_lead_lines",
"pillar3_lead_line_replacement",
"pillar3_proactive_maintenance",
"pillar3_outreach",
"pillar3_meetings",
"pillar3_water_board",
"pillar3_flooding",
"pillar3_complaints"
)
geo_type <- c("block_group",alt_name,"service_area")
geo_id <- unique(bg_data$GEOID)
geo_id <- as.data.frame(c(geo_id,unique(alt_data$alt_polygon_id)))
geo_id <- rbind(geo_id,name)
colnames(geo_id)[1] <- "geo_id"
pillars <- expand_grid(geo_id,year,var)
l_id <- length(geo_id[,1])
l_bg <- length(unique(bg_data$GEOID))
l_alt <- length(unique(alt_data$alt_polygon_id))
l_vars <- length(var)
l_year <- length(year)
l_bg <- l_bg*l_vars*l_year
l_alt <- l_alt*l_vars*l_year
pillars$geo_type <- "block_group"
pillars$geo_type[(l_bg+1):(l_bg+l_alt)] <- alt_name
pillars$geo_type[(l_bg+l_alt+1):length(pillars$geo_type)] <- "service_area"
pillars$units <- "%"
pillars$units[which(pillars$var %in% c("pillar1_hbi",
"pillar2_workforce",
"pillar2_public_facing",
"pillar3_breaks",
"pillar3_sewer_failures"))] <- "ratio"
Below, we populate the indicator data model with random uniform variables. This step should be replaced by an entire workflow that impots submitted data and calculates the variables
pillars <- pillars %>% group_by(var,year,geo_id) %>% mutate(value=runif(1))
pillars$value[which(pillars$units=="%")] <- 100*pillars$value[which(pillars$units=="%")]
Below, we calculate the percent change year-over-year in the indicators for each polygon, where applicable
pillars <- pillars %>%
arrange(geo_id,var,year) %>%
group_by(geo_id,var) %>%
mutate(delta_percent = 100*(value - lag(value))/lag(value)) %>%
ungroup()
pillars <- select(pillars,geo_id,geo_type,year,var,value,units,delta_percent)
pillars$units[which(pillars$var=="pillar1_affordability")] <- "character"
pillars$value[which(pillars$var=="pillar1_affordability")] <- sample(c("Low","Medium","High"),length(pillars$var[which(pillars$var=="pillar1_affordability")]),replace=TRUE)
pillars$delta_percent[which(pillars$units=="character")]<-NA
Here, we demonstrate how a “bucket” can be specified a priori for each variable, and then used to calculate the appropriate bucket in the data. Here we use the fuzzyjoin R package, but the equivalent logic is in this SQL query
'SELECT vars, value, bucket_names
FROM pillars
LEFT JOIN buckets ON value BETWEEN lower_bound and upper_bound'
buckets <- read_csv("data/buckets.csv") %>% select(vars,bucket_names,lower_bound,upper_bound)
## New names:
## * `` -> ...1
## Rows: 69 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): vars, bucket_names
## dbl (3): ...1, lower_bound, upper_bound
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
pillars$val_numeric <- as.numeric(pillars$value)
## Warning: NAs introduced by coercion
library(fuzzyjoin)
# Attempt #3: use the fuzzyjoin package
pillars<-fuzzy_left_join(pillars, buckets,
by=c("val_numeric"="lower_bound", "val_numeric"="upper_bound"),
match_fun=list(`>=`, `<=`)) %>%
select(geo_id:units,bucket_names,delta_percent) %>%
rename(var_bucket=bucket_names)%>%
distinct(.keep_all=TRUE)
Below, we calculate an indicator variable for which quartile a given polygon’s percent change (delta_percent) is.
qe <- pillars %>%
group_by(geo_type,var,year) %>%
summarise(delta_quartile=quantile(delta_percent,
probs=c(0.25,0.5,0.75),na.rm=TRUE),
q=c("p25","p50","p75")) %>%
pivot_wider(names_from=q, values_from=delta_quartile)
## `summarise()` has grouped output by 'geo_type', 'var', 'year'. You can override using the `.groups` argument.
qe <- left_join(pillars,qe,by=c("geo_type","var","year"))
qe$delta_quartile = NA
qe <- qe %>% mutate(
delta_quartile = case_when(delta_percent <= p25 ~ 1,
delta_percent > p25 & delta_percent <= p50 ~ 2,
delta_percent > p50 & delta_percent <= p75 ~ 3,
delta_percent > p75 ~ 4)
)
Below, we isolate only the columns of interest and export the indicator data to csv.
pillars <- qe %>% select(geo_id,geo_type,year,var,units,value,var_bucket,delta_percent,delta_quartile)
write_csv(pillars,"finalData/indicators.csv")
Below, we add the block group and alternate polygon and overall aggregated census data into one long-form data frame
bg_data2 <- left_join(bg_data,v,by="var")
alt_data$geo_type <- alt_name
alt_data$geo_id <- alt_data$alt_polygon_id
alt_data <- select(alt_data,geo_id,geo_type,var,value,units)
bg_data2$geo_type <- "block_group"
bg_data2$geo_id <- bg_data2$GEOID
bg_data2 <- select(bg_data2,geo_id,geo_type,var,value,units)
overall$geo_id <- name
overall$geo_type <- "service_area"
overall$value <- as.numeric(overall$value)
overall2 <- left_join(overall,v,by="var")
overall2 <- select(overall2,geo_id,geo_type,var,value,units)
census <- bind_rows(bg_data2,alt_data,overall2)
census$year <- 2019
census <- select(census,geo_id,geo_type,year,var,value,units)
Below, we add quartiles for each census variable.
qc <-census %>%
group_by(geo_type,var,year) %>%
summarise(var_quartile=quantile(value,
probs=c(0.25,0.5,0.75),na.rm=TRUE),
q=c("p25","p50","p75")) %>%
pivot_wider(names_from=q, values_from=var_quartile)
## `summarise()` has grouped output by 'geo_type', 'var', 'year'. You can override using the `.groups` argument.
qc <- left_join(census,qc,by=c("geo_type","var","year"))
qc$var_quartile = NA
qc <- qc %>% mutate(
var_quartile = case_when(value <= p25 ~ 1,
value > p25 & value <= p50 ~ 2,
value > p50 & value <= p75 ~ 3,
value > p75 ~ 4)
)
Below, we isolate only the columns of interest and export the indicator data to csv.
census <- qc %>% select(geo_id,geo_type,year,var,units,value,var_quartile)
write_csv(census,"finalData/census.csv")
For Districts, we see that the first component alone explains 45% of the variance in the specified variables. Higher values on this component are associated with lower income measures, higher poverty, public assistance and unemployment, lower rents and housing values, more blue collar employment, higher proportions of minority groups, lower homeownership rates, less vehicle ownership, and higher prevalence of single-parent families, without being particularly informed by linguistic isolation, education levels, or housing crowdedness.
kable(summary(pca.alt)$importance[1:3,1:3],digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| Standard deviation | 2.76 | 2.03 | 1.39 |
| Proportion of Variance | 0.45 | 0.24 | 0.11 |
| Cumulative Proportion | 0.45 | 0.69 | 0.81 |
kable(pca.alt$rotation,digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| census_homeValuemedian | -0.27 | 0.30 | 0.14 |
| census_income_median | -0.35 | -0.11 | 0.05 |
| census_income_percapita | -0.32 | 0.13 | 0.20 |
| census_percAgeOver64 | 0.00 | -0.16 | 0.42 |
| census_percBelow200FPL | 0.35 | 0.04 | -0.12 |
| census_percBlueCollar | 0.30 | -0.20 | -0.22 |
| census_percEducLessHS | -0.11 | 0.39 | -0.26 |
| census_percFamiliesSingleParent | 0.32 | 0.02 | 0.04 |
| census_percHH_Crowded | -0.04 | 0.40 | 0.10 |
| census_percHH_LangIsolated | 0.05 | 0.38 | -0.39 |
| census_percHH_NoVehicle | 0.28 | 0.25 | 0.21 |
| census_percHomeowners | -0.19 | -0.38 | -0.06 |
| census_percMinority | 0.20 | 0.15 | 0.47 |
| census_percPlumbingComplete | -0.09 | -0.33 | -0.15 |
| census_percUnemployed | 0.25 | -0.11 | 0.36 |
| census_rentIncomeRatioMedian | 0.26 | 0.02 | 0.00 |
| census_rentMedian | -0.28 | 0.07 | 0.22 |
For block groups, we see that the first component alone explains only 38% of the variance in the specified variables (compared to 45% in the District-level analysis). However, the way the variables contribute to the first component is similar to how they do in the tract-level analysis.
kable(summary(pca.bg)$importance[1:3,1:3],digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| Standard deviation | 2.55 | 1.30 | 1.11 |
| Proportion of Variance | 0.38 | 0.10 | 0.07 |
| Cumulative Proportion | 0.38 | 0.48 | 0.56 |
kable(pca.bg$rotation,digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| census_percBelow200FPL | -0.36 | 0.11 | -0.04 |
| census_percEducLessHS | -0.03 | 0.53 | 0.10 |
| census_percMinority | -0.29 | -0.02 | 0.11 |
| census_income_median | 0.36 | 0.00 | -0.04 |
| census_income_percapita | 0.34 | 0.10 | 0.21 |
| census_rentMedian | 0.25 | 0.20 | -0.05 |
| census_rentIncomeRatioMedian | -0.21 | -0.06 | 0.02 |
| census_homeValuemedian | 0.26 | 0.37 | 0.23 |
| census_percHomeowners | 0.25 | -0.39 | -0.09 |
| census_percBlueCollar | -0.29 | -0.18 | -0.16 |
| census_percUnemployed | -0.16 | -0.18 | 0.43 |
| census_percPlumbingComplete | 0.07 | -0.18 | -0.48 |
| census_percFamiliesSingleParent | -0.25 | -0.02 | 0.03 |
| census_percHH_NoVehicle | -0.30 | 0.09 | 0.31 |
| census_percAgeOver64 | 0.11 | -0.25 | 0.49 |
| census_percHH_LangIsolated | -0.16 | 0.43 | -0.30 |
| census_percHH_Crowded | -0.02 | 0.10 | 0.04 |
Below, we visualize how the first component (PC1) can be used as a multidimensional socioeconomic status/ deprivation index, that incorporates more information and yields different results than just using income or poverty levels or minority group prevalence.
bgm <- left_join(bg,bg_data_wide,by="GEOID")
# tr <- left_join(tr,pcaData.tr,by="GEOID")
# tr$PC1_allVars <- -tr$PC1_allVars
m.pc.bg <- mapview(bgm,zcol="PC1", layer.name="Block Groups \n PC1") + mapview(boundary,alpha.regions=0,col.regions="red",color="red",layer.name="BSA.", lwd=2)
m.pov.bg <- mapview(bgm,zcol="census_income_median", layer.name="Block Groups \n Median HH Income") + mapview(boundary,alpha.regions=0,col.regions="red",color="red",layer.name="BSA", lwd=2)
#
# m.pc.tr <- mapview(tr,zcol="PC1_allVars", layer.name="Tracts \n PC1") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="BSA", lwd=2)
# m.pov.tr <- mapview(tr,zcol="income_median", layer.name="Tracts \n Median HH Income") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="BSA", lwd=2)
#
mapl <- sync(m.pc.bg, m.pov.bg)
mapl